Agent-Routing

Your agent keeps working when OpenAI goes down, your quota runs out, or a model gets deprecated.

Solve Track 04 · Reliability. Agent-Routing is a multi-provider LLM router with per-task fallback cascades, session-level circuit breakers, budget boundaries, token compression, and prompt injection filters. It ensures agent systems never freeze due to single API timeouts. Extracted from 18 months of production Agentic OS.

Open source github.com ↗
Track
Solve 04 · Multi-Provider Routing & Reliability
Runtime
Node.js 18+ Offline backup compatible (Ollama)
Providers
OpenAI Anthropic Gemini Ollama
Tests
13 tests covering failover paths, backoff, and circuit breaker cycles
Repository

Multi-Provider LLM Router Flowchart : Annotated Reference

Verify injection, compress tokens, estimate cost against budget, run cascading fallback, and trip breakers.

The problem

Relying on a single LLM API is a single point of failure. If you build an agent system that only speaks to OpenAI, you are at the mercy of rate limits, server outages, billing caps, and network timeouts. When a customer is waiting for a response and OpenAI returns a 503 error, your entire agent pipeline crashes.

Furthermore, different tasks require different models. Running simple text cleanup on GPT-4o is a waste of money; running code writing on a weak model leads to errors. Agent-Routing resolves this by wrapping model requests in a resilient cascade. It routes tasks based on priority chains, tracks errors to temporarily disable down services, and estimates costs beforehand to enforce budget limits.

How it works: step by step

  • Step 1: Task Classification. The developer specifies a task class (e.g. code, content, ui, or simple). The router matches the class to a pre-defined fallback chain. For example, code cascades through OpenAI → Gemini → Anthropic → Ollama, while simple runs Ollama first to keep costs at zero.
  • Step 2: Token Compression. Before dispatch, the prompt is compressed. The system collapses consecutive whitespaces and removes redundant boilerplate, yielding 10-20% token savings without degrading performance.
  • Step 3: Security & Budget Checks. The input is scanned for known prompt injection attempts. Simultaneously, the system estimates the token cost based on the target model's pricing. If the cost exceeds the user's budget limit, the call is rejected instantly without touching the API.
  • Step 4: Cascading Execution. The router queries the first provider in the chain. If it receives a network error, a 429 rate limit, or a timeout, it catches the exception, logs it, and immediately forwards the request to the second provider.
  • Step 5: Session Circuit Breaking. Each provider is tracked by a circuit breaker state machine. If a provider fails 3 times in a row, its circuit opens, marking it OPEN. All subsequent requests for the next 5 minutes bypass that provider entirely, saving time.

Interactive: Failover Cascade Simulator

Simulate API outages and watch the router cascade through the fallback chain for a Code Task (Chain: OpenAI → Gemini → Anthropic → Ollama).

Provider Status

Routing Trace

Ready. Configure providers and send task.

Circuit Breaker States

Each provider cycles through three states. This prevents a broken endpoint from constantly adding timeout latency to requests:

  • CLOSED: Healthy state. Requests route normally. If a request fails, the failure count increases. If the failure count crosses 3, the circuit opens.
  • OPEN: Unhealthy state. All requests bypass the provider instantly. The system starts a 5-minute cooldown timer. After cooldown, the circuit enters HALF_OPEN.
  • HALF_OPEN: Diagnostic state. The system sends a single pilot request. If it succeeds, the circuit closes back to closed. If it fails, the circuit opens again, resetting the cooldown.

Model Cascades

Default priority chains optimized for pricing, speed, and capability:

Task Class Cascade Order Fail-Safe Backup
Code gpt-4o → claude-3-5-sonnet → gemini-1.5-pro ollama (qwen2.5-coder:7b)
Content gemini-1.5-flash → gpt-4o-mini → claude-3-5-haiku ollama (llama3:8b)
Simple gpt-4o-mini → gemini-1.5-flash ollama (gemma2:2b)

How to run it

git clone https://github.com/shubham0086/Agent-Routing
cd Agent-Routing
npm install

# Run the failover simulator
node demo/failover.js

The API

import { Router } from 'agent-routing';

const keys = {
    openai: process.env.OPENAI_API_KEY,
    anthropic: process.env.ANTHROPIC_API_KEY,
    gemini: process.env.GEMINI_API_KEY
};

const router = new Router(keys, 'http://localhost:11434', 'openai');

// Send chat task with failover security and USD budget limits
const response = await router.chat("Write a fast JSON parser", {
    taskClass: 'code',
    budget: 0.05, // limit call to max 5 cents
    temperature: 0.2
});

Where this fits

Agent-Routing represents the resilient backend gateway of the autonomy ladder. It implements Pattern 02 (Multi-Provider LLM Routing) in Agentic Patterns. The capstone platform AgentKernel routes all of its API traffic through this client to protect agent routines from timeouts.

Honest framing

A failover cascade hides model-specific format differences. If a task fails on OpenAI and shifts to Gemini, differences in system prompt support or tool call schemas can cause downstream errors. The router normalizes responses into a unified message structure, but when designing complex multi-agent tools, you must test fallback models to ensure they understand your function descriptions. Resiliency requires slightly simpler interfaces.